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Abstract 

In Bayesian statistics, many problems can be expressed as the evaluation of the 
expectation of a quantity of interest with respect to the posterior distribution. Stan- 
dard Monte Carlo method is often not applicable because the encountered posterior 
distributions cannot be sampled directly. In this case, the most popular strategies 
. are the importance sampling method, Markov chain Monte Carlo, and annealing. In 

this paper, we introduce a new scheme for Bayesian inference, called Asymptotically 
Independent Markov Sampling (AIMS), which is based on the above methods. We 
derive important ergodic properties of AIMS. In particular, it is shown that, under 
certain conditions, the AIMS algorithm produces a uniformly ergodic Markov chain. 
The choice of the free parameters of the algorithm is discussed and recommenda- 

■ tions are provided for this choice, both theoretically and heuristically based. The 
i efficiency of AIMS is demonstrated with three numerical examples, which include 

both multi-modal and higher-dimensional target posterior distributions. 

» 
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oo . 1 Three cornerstones of computational Bayesian inference 
00 ■ 

In Bayesian statistics, many problems can be expressed as the evaluation of the expecta- 
tion of a quantity of interest with respect to the posterior distribution. Standard Monte 
Carlo simulation |MU49] . where expectations are estimated by sample averages based on 
samples drawn independently from the posterior, is often not applicable because the en- 
countered posterior distributions are multi-dimensional non-Gaussian distributions that 
^ ' cannot be explicitly normalized. In this case, the most popular strategies are importance 

■ sampling and Markov chain Monte Carlo methods. We briefly review these two methods 
first because they play an important role in the new MCMC method introduced in this 
paper. 

Importance sampling: This is nearly as old as the Monte Carlo method (see, for 
instance, |KM53] ). and works as follows. Suppose we want to evaluate E^[/i] that is an 
expectation of a function of interest /i : 6 — M under distributioiH 7r(-) defined on a 
parameter space B C M*^, 

E^[h] = I h{9)n{9)d6. (1) 
Je 

Suppose also that we are not able to sample directly from 7r(-), although we can compute 
IT {6) for any ^ G to within a proportionality constant. Instead, we sample from some 



^ Both authors contributed equally to this work. Corresponding author's email: zuev@caltech.edu. 

^Unless otherwise stated, all probability distributions are assumed to have densities with respect to 
Lebesgue measure, TT{d9) = n{6)d6. For simplicity, the same symbol will be used to denote both the 
distribution and its density, and we write 6 ^ 7r(-) to denote that 9 is distributed according to 7r(-). 
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other distribution g(-) on G which is readily computable for any 9 E Q. Let 9^^\ . . . , 9^^'> 
be N i.i.d. samples from g(-), and w'-*-' = 7r(6''^*))/g(6'^*)) denote the importance weight of 
the i^^ sample, then we can estimate ]E^[/i] by 

The estimator converges almost surely as — ?■ oo to Ejr[/i] by the Strong Law of 
Large Numbers for any choice of distribution g(-), provided supp(7r) C supp(g). Note 
that the latter condition automatically holds in Bayesian updating using data T> where 
q{9) = 7ro(^) is the prior density and 7r{9) oc tio{9)L{9) is the posterior p{9\'D), where L 
stands for the likelihood function p{T>\9). 

The estimator /i^v in (EJ generally has a smaller mean square error than a more straight- 
forward unbiased importance sampling estimator: 

1 ^ 

i=l 

This is especially clear when h is nearly a constant: ii h then c, while h'j^ has 

a larger variation. Although is biased for any finite A^, the bias can be made small 
by taking sufficiently large A^, and the improvement in variance makes it a preferred 
alternative to h'pf |Li01l IRC04j . Another major advantage of using instead of h'j^, 
which is especially important for Bayesian applications, is that in using the former we 
need to know 7r(6') only up to a multiplicative normalizing constant; whereas in the latter, 
this constant must be known exactly. 

The accuracy oi depends critically on the choice of the importance sampling dis- 
tribution (ISD) g(-), which is also called the instrumental or trial distribution. If g(-) is 
chosen carelessly such that the the importance weights w^^^ have a large variation, then 
fiN is essentially based only on the few samples 9^^^ with the largest weights, yielding 
generally a very poor estimate. Hence, for importance sampling to work efficiently, g(-) 
must be a good approximation of 7r(-) — "the importance sampling density should mimic 
the posterior density" |Ge89] — so that the variance var5[w] is not large. Since usually the 
prior and posterior are quite different, it is, therefore, highly inefficient to use the prior as 
the importance sampling distribution. When is high- dimensional, and 7r(-) is complex, 
finding a good importance sampling distribution can be very challenging, limiting the 
applicability of the method |AB03] . 

For the estimator h'j^ in ([3]), it is not difficult to show that the optimal importance 
sampling density, i.e., q*{-) that minimizes the variance of h'j^, is q*{9) oc \h{9)\7r{9). This 
result is sometimes attributed to Rubinstein |Ru81] . although it was proved earlier by 
Kahn and Marshall |KM53] . It is not true, however, that q*{-) is optimal for the estimator 
hj^f. Note also that this optimality result is not useful in practice, since when h{9) > 0, 
the required normalizing constant of q*{-) is h{9)7r{9)d9, the integral of interest. 

MCMC Sampling: Instead of generating independent samples from an ISD, we could 
generate dependent samples by simulating a Markov chain whose state distribution con- 
verges to the posterior distribution 7r(-) as its stationary distribution. Markov chain 
Monte Carlo sampling (MCMC) originated in statistical physics, and now is widely used 
in solving statistical problems [N^ I(;RS961 [LiOTl IR(]n4j . 
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The Metropolis-Hastings algorithm |MR^T^53i IHa70j . the most popular MCMC tech- 



nique, works as follows. Let q{-\0) be a distribution on 0, which may or may not 
depend on 6* G 0. Assume that q{-\6) is easy to sample from and it is either com- 
putable (up to a multiplicative constant) or symmetric, i.e. q{^\9) = q{9\^). The sam- 
pling distribution q{-\6) is called the proposal distribution. Starting from essentially any 
9^^^ E supp(7r), the Metropolis-Hastings algorithm proceeds by iterating the following 
two steps. First, generate a candidate state ^ from the proposal density q{-\9^"'^). Sec- 
ond, either accept ^ as the next state of the Markov chain, ^("+^) = ^, with probability 

a{^\9^"'^) = min |l, or reject and set 9^"''^^^ = 6"^"^ with the remaining 

probability 1 - It can be shown (see, for example, [RC04j ). that under fairly 

weak conditions, 7r(-) is the stationary distribution of the Markov chain 9^^\ 9^'^\ . . . and 
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1 ^ r 

-J]M^»)= / hi9)7ri9)d9. (4) 

i=i 



Since the chain needs some time (so called "burn-in" period) to converge to stationarity, 
in practice, an initial portion of, say, A^'o states is usually discarded and 
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V E (5) 



u i=No+l 

is used as an estimator for [/;,]. 

The two main special cases of the Metropolis-Hastings algorithm are Independent 
Metropohs-Hastings (IMH), where the proposal distribution q{C,\9) = gg(0 is independent 
of 9 (so qg is a global proposal), and Random Walk Metropolis-Hastings (RWMH), where 
the proposal distribution is of the form q{C,\9) = qi{^ — 9), i.e. a candidate state is proposed 
as ^ = 6''-"^ + en, where e„ ~ qi{-) is a random perturbation (so qi is a local proposal). In 
both cases, the choice of the proposal distribution strongly affects the efficiency of the 
algorithms. For IMH to work well, as with importance sampling, the proposal distribution 
must be a good approximation of the target distribution vr(-), otherwise a large fraction of 
the candidate samples will be rejected and the Markov chain will be too slow in covering 
the important regions for 7r(-). When, however, it is possible to find a proposal qg{-), such 
that qg{-) ~ 7r(-), IMH should always be preferred to RWMH because of better efficiency, 
i.e. better approximations of ]E^[/i] for a given number of samples A^. Unfortunately, such 
a proposal is difficult to construct in the context of Bayesian inference where the posterior 
7r(-) is often complex and high-dimensional. This limits the applicability of IMH. 

Since the random walk proposal qi{-) is local, it is less sensitive to the target distri- 
bution. That is why, in practice, RWMH is more robust and used more frequently than 
IMH. Nonetheless, there are settings where RWMH also does not work well because of 
the complexity of the posterior distribution. Although (jlj) is true in theory, a potential 
problem with RWMH (and, in fact, with any MCMC algorithm) is that the generated 
samples 9^^\ . . . ,9^^^ often consist of highly correlated samples. Therefore, the estima- 
tor /i AT in ([5]) obtained from these samples tends to have a large variance for a modest 
amount of samples. This is especially true when the posterior distribution contains sev- 
eral widely-separated modes: a chain will move between modes only rarely and it will 
take a long time before it reaches stationarity. If this is the case, an estimate produced 
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by /lAT will be very inaccurate. At first glance, it seems natural to generate several in- 
dependent Markov chains, starting from different random seeds, and hope that different 
chains will get trapped by different modes. However, multiple runs will not in general 
generate a sample in which each mode is correctly represented, since the probability of 
a chain reaching a mode depends more on the mode's "basin of attraction" than on the 
probability concentrated in the mode [Ne96] . 

Annealing: The concept of annealing (or tempering), which involves moving from 
an easy-to-sample distribution to the target distribution via a sequence of intermediate 
distributions, is one of the most effective methods of handling multiple isolated modes. 
Together with importance sampling and MCMC, annealing constitutes the third corner- 
stone of computational Bayesian inference. 

The idea of using the RWMH algor ithm in conjunction with annealing was introduced 
independently in |KGV83] and Ce85 for solving difficult optimization problems. The 
resulting algorithm, called Simulated Annealing, works as follows. Suppose we want to 
find the global minimum of a function of interest /i : O — t- M. This is equivalent to finding 
the global maximum of /r(^) = ^^p{—h{6)/T) for any given T > 0. By analogy with the 
Gibbs distribution in statistical mechanics, T is called the temperature parameter. Let 
To > Ti > . . . be a sequence of monotonically decreasing temperatures, in which Tq is 
large enough so that the probability distribution 71q{9) oc /ro(^) is close to uniform, and 
limj^oo Tj = 0. At each temperature Tj, the Simulated Annealing method generates a 
Markov chain with 7rj{6) oc exp{—h{6)/Tj) as its stationary distribution. The final state 
of the Markov chain at simulation level j is used as the initial state for the chain at level 
j + 1. The key observation is that for any function h such that jQexp{—h{6)/T)d6 < oo 
for all T > 0, distribution vr^ (■), as j increases, puts more and more of its probability mass 
(converging to 1) into a neighborhood of the global minimum of h. Therefore, a sample 
drawn from vrj(-) would almost surely be in a vicinity of the global minimum of h when 
Tj is close to zero. 

The success of Simulated Annealing in finding the global minimum crucially depends 
on the schedule of temperatures used in the simulation. It was proved in [GG84] that if 
a logarithmic schedule Tj = To/log(j + 1) is used, then, under certain conditions, there 
exists a value for Tq such that use of this schedule guarantees that the global minimum 
of h will be reached almost surely. In practice, however, such a slow annealing schedule 
is not computationally efficient. It is more common to use either a geometric schedule, 
Tj+i = ■yTj with < 7 < 1, or some adaptive schedule, which defines the temperature for 
the next annealing level based on characteristics of the samples observed at earlier levels. 
For examples of adaptive annealing schedules, see, for instance, |Ne93j . 

In Bayesian inference problems, the idea of annealing is typically employed in the 
following way. First, we construct (in advance or adaptively) a sequence of distributions 
VTo (■),..., vrm(-) interpolating between the prior distribution vro(-) and the posterior distri- 
bution 7r(-) = njn{-)- Next, we generate i.i.d. samples 6l^\ . . . , from the prior, which 
is assumed to be readily sampled. Then, at each annealing level j, using some MCMC 
algorithm and samples , • • • , Oj^\ from the previous level j — 1 , we generate samples 

ef\...,ef^ which are approximately distributed according to vrj(-). We proceed sequen- 
tially in this way, until the posterior distribution has been sampled. The rationale behind 
this strategy is that sampling from the multi-modal and, perhaps, high-dimensional pos- 
terior in such a way is likely to be more efficient than a straightforward MCMC sampling 
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of the posterior. 

The problem of samphng a complex distribution is encountered in statistical mechan- 
ics, computational Bayesian inference, scientific computing, machine learning, and other 
fields. As a result, many different efficient algorithms have been recently developed, e.g. 
the method of Simulated Tempering |MP92l IGT95] . the Tempered Transition method 
|Ne96] . Annealed Importance Sampling |Ne01] . the Adaptive Metropolis-Hastings algo- 
rithm |BA02] . Transitional Markov Chain Monte Carlo method |CC07] . to name a few. 

In this paper we introduce a new MCMC scheme for Bayesian inference, called Asymp- 
totically Independent Markov Sampling (AIMS), which combines the three approaches 
described above — importance sampling, MCMC, and annealing — in the following way. 
Importance sampling with 7rj_i(-) as the ISD is used for a construction of an approxima- 
tion 7rj^(-) of 7rj(-), which is based on samples 6*]^^, • • • , Oj^l ~ This approximation 
is then employed as the independent (global) proposal distribution for sampling from 7rj{-) 
by the IMH algorithm. Intermediate distributions vro(-), . . . ,vrm(-) interpolating between 
prior and posterior are constructed adaptively, using the essential sample size (ESS) to 
measure how much 7rj„i(-) differs from vrj(-). When the number of samples — )■ oo, the 
approximation ^j^(-) converges to vrj(-), providing the optimal proposal distribution. In 
other words, when N ^ oo, the corresponding MCMC sampler produces independent 
samples, hence the name of the algorithm. 

Remark 1 . The term "Markov sampling" has several different meanings. In this paper it 
is used as synonymous to "MCMC sampling" . 

In this introductory section, we have described all the main ingredients that we will 
need in the subsequent sections. The rest of the paper is organized as follows. In Section [21 
the AIMS algorithm is described. The ergodic properties of AIMS are derived in Section 
[3l The efficiency of AIMS is illustrated in Section H] with three numerical examples 
that include both multi-modal and high- dimensional posterior distributions. Concluding 
remarks are made in Section O 

2 Asymptotically Independent Markov Sampling 

Let 7ro(-) and 7r(-) be the prior and the posterior distributions defined on a parameter 
space 0, respectively, so that, according to Bayes' Theorem, 7i{9) oc 7io{9)L{9), where L 
denotes the likelihood function for data V. Our ultimate goal is to draw samples that are 
distributed according to 7r(-). 

In Asymptotically Independent Markov Sampling (AIMS), we sequentially generate 
samples from intermediate distributions 7ro(-), . . . ,7Tm{') interpolating between the prior 
7ro(-) and the posterior 7r(-) = 7rm(-)- The sequence of distributions could be specially 
constructed for a given problem but the following scheme [NeOH ICC07] generally yields 
good efficiency: 

7r,(^) (X 7^o(^)L(^)'^^ (6) 

where = /3o < /3i < • • • < /3m = 1- We will refer to j and Pj as the annealing level and 
the annealing parameter at level j, respectively. In the next subsection, we assume that 
Pj is given and therefore the intermediate distribution 7rj{-) is also known. In Subsection 
\2.2\ we describe how to choose the annealing parameters adaptively. 
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2.1 AIMS at annealing level j 

Our first goal is to describe how AIMS generates sample 6^^\ . . . , 6^^'^ from ttj^-) based 

on the sample ~ 7rj„i(-) obtained at the previous annealing level. We 

start with an informal motivating discussion that leads to the simulation algorithm. In 
Section [3], we rigorously prove that the corresponding algorithm indeed generates samples 
which are asymptotically distributed according to vrj(-), as the sample size Nj — )■ oo. 
Moreover, the larger iVj_i, the less correlated generated samples 9^^\ . . . ,9^^'^ are — a 
very desirable, yet rarely affordable, property for any MCMC algorithm. 

Let Kj{-\-) be any transition kernel such that vrj(-) is a stationary distribution with 
respect to Kj{-\-). By definition, this means that 

n,ie)de= I K,{de\Orr,{m (7) 

Applying importance sampling with the sampling density 7rj_i(-) to integral ([7]), we have: 

N,^, (8) 



i=l 



where ttj ^ (■) will be used as the global proposal distribution in the Independent Metropolis- 
Hastings algorithm, and 



f^. = ^^o:L(0fl^^.-'.-' and '^fl, = ^.j^ (9) 



are the importance weights and normalized importance weights, respectively. Note that 
to calculate Wj'li, we do not need to know the normalizing constants of 7rj_i(-) and vrj(-). 
If adjacent intermediate distributions 7rj_i(-) and 7ij{-) are sufficiently close (in other 
words, if APj = 13 j — is small enough), then the importance weights ([2]) will not vary 
wildly, and, therefore, we can expect that, for reasonably large Nj^i, approximation ([8]) 
is accurate. 

Remark 2. In [CBlOj . the stationary condition ([7]) was used for an analytical approxima- 
tion of the target PDF to evaluate the evidence (marginal likelihood) for a model. 

Remark 3. Note that for any finite Nj^i, distribution vrj^^"^(-) will usually have both 
continuous and discrete parts. This follows from the fact that the transition kernel in 
Markov chain simulation usually has the following form: K{d6\^) = k{6\C,)d6 + r{^)S^{d6), 
where k{-\-) describes the continuous part of the transition kernel, 6^{-) denotes the Dirac 
mass at ^, and r(^) = 1-/0 k{6\C,)d6. This is the form, for example, for the Metropolis- 
Hastings algorithm. Therefore, ([8]) must be understood as the approximate equality 
of distributions, not densities. In other words, ([HD means that E ^j-i [^] ~ [^] and 

E,]Vj_i [h] —7- E^^. [/i], when Nj^i — )■ 00, for all integrable functions h. See also Example 2.1 

below. 
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From now on, we consider a special case where Kj{-\-) is the random walk Metropolis- 
Hastings (RWMH) transition kernel. In this case, it can be written as follows: 

K,{de\0 = min |l, ^| + (1 - aj{0)S^{de), (10) 

where qj{-\C) is a symmetric local proposal density, and aj{^) is the probability of having 
a proper transition ,^ to \ {^}: 

a,{0 = J^qAm^^{l,^]de (11) 

Example 2.1. As a simple illustration of (jH]), consider the case when 7rj(-) = A/'(-|0, 1), 
7rj_i(-) = A/'(-|0, 2), and qj{-\E,) = M{-\C,,l/2), where M{-\fi,a^) denotes the Gaussian 
density with mean fi and variance a^. The approximation tTj ^"^(■) based on the samples 

Ofli, . . . ,9j^{'^^ ~ A/'(-|0,2) is shown in the top panels of Figure [H for A^j-i = 5 and 

Nj-i = 50. Suppose that hi{9) = 9 and h2{9) = 9^ are the functions of interest. Then 

E^J^i] = and E^J/ig] = 1- The convergence of /iJ(iV,„i) = E^N^_-,[hi] and hl{Nj^i) = 

"-J 

E-Afj-i [h2] is shown in the bottom panel of Figure [H 

"-J 

For sampling from vrj(-), we will use the Independent Metropolis-Hastings algorithm 
(IMH) with the global proposal distribution Tf^^'^{-). To accomplish this, we have to be 
able to calculate the ratio 7r^'~^ (9) /fr^^'^ {^) for any 9,^ G 6 as a part of the expression 

already mentioned, the distribution Tij^^^{-) does not have a density since it has both 

continuous and discrete components, and, therefore, the ratio t^^^'^ {9) /ti^^^^ makes 
no sense. To overcome this "lack-of-continuity problem", taking into account (|H]) and 
ffTOj) . let us formally define the global proposal distribution over 6 as: 

^?-iO) = E u^^,i9\9^;) mm (l, ^1 , (12) 

if^^{^fJi,...,^fr^},and 

7rf-^(^fJi) = oo (13) 

Note that ti^^'^{-) is a distribution on 6, but it does not have a density. However, 7r^^"^(-) 
induces another distribution on \ • • • , which does have a density, given 

by the r.h.s. of f fT2|) . This motivates f|T2|) . 



for the acceptance probability aj{^\9) = min <| 1, ^Nj_-^ — 7 (■ However, as it has been 



Now, using ( fT2l) and ( lT3l) . we can calculate the ratio vr^-' ^{9)/ti^^ ^(^) as follows: 



I. Ife,e^{^fi,...,^fr)},then 



tt/ (0 















^,*(fl'';'2,)mm] 









(14) 
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II. If e i \e% . . . , efx"'^^ and i = ef\, then 



2 



III. If e = ef\ and e i [ef\^ . . . , efs'^^ then 



and aj{i\e) = (15) 



— -- = oo and aJne) = 1 (16) 

IV. If ^ = 9f\ and e = then %[ is not defined. 

Notice that in the first three cases the ratio if^^'^ {6)/Tt^^^^{^) is readily computable, 
while in Case IV, it is not even defined. Therefore, it is very desirable to avoid Case 
IV. The key observation that allows us to do this is the following: suppose that the 



initial state 9j of the Markov chain that is generated is such that 9jEQ* = Q\ 
[of}^,...,e\^{-'^Y then ef G e* for alH > 1. Indeed, the only way for the chain to 

enter the set . . . , 6'j^^^"^''| is to generate a candidate state ^ G j^'j-L^, • • • , 

however, according to Case II, such a candidate will always be rejected. Thus, by replacing 
the state space by 9* and using ( !T4l) and ( ITSl) for evaluation of ttj ^'^{O)/^^ ^"^(O? 



N • 1 



are able to calculate the acceptance probability aj{^\6) = min < 1, — jj—^ — > involved 

in the IMH algorithm. It is clear that the replacement of O by G* is harmless for the 
ergodic properties of the Markov chain when C W'-. 

Remark 4. One may wonder why not just use the continuous part of ti^^~^{-) as the 
global proposal density within the IMH algorithm. In other words, why not use the 
density ^jcont(')5 which is proportional to the function defined by f|T2l) . as the proposal 
density. Indeed, in this case we would not have any difficulties with calculating the ratio 
^'^{Cj- The problem is that it is not clear how to sample from T^j^conti')^ while 
sampling from if^^'^{d6) = ^^f^ w^^liKj{d6\6j'l-^^) is straightforward. 

The above discussion leads to the following algorithm for sampling from the distribu- 
tion 7rj(-): 

AIMS at annealing level j 

Input : 

> ofli, • • • , ~ 7rj_i(-), samples generated at annealing level j — 1; 

> ef G e* = \ . . . , initial state of a Markov chain; 

> gj(-|^), symmetric proposal density associated with the RWMH kernel; 

> Nj, total number of Markov chain states to be generated. 
Algorithm: 

for i = 1, Nj — 1 do 

1) Generate a global candidate state $,g ~ tTj ^~^(-) as follows: 

8 



a. Select k from {1, . . . , Nj_i} with probabilities given by @ 

b. Generate a local candidate ~ 

c. Accept or reject ^/ by setting 

^i, with probability min 1 1, J , ^^^^ 
ef}^, with the remaining probability. 
2) Update ^j*^ — > ^j*'*'^'' by accepting or rejecting as follows: 

if = 

Set ef+'^ = ef 

else 

Set 

fl(m) J with probability mini 



of , with the remaining probability. 



end if 
end for 



Output : 

► of , . . . , Oj^^\ Nj states of a Markov chain with a stationary distribution 

Schematically, the AIMS algorithm at annealing level j is shown in Figure [21 The 
proof that 7ij{-) is indeed a stationary distribution for the Markov chain generated by 
AIMS is given in Section |3l 

Remark 5. As usually for MCMC algorithms, the fact of convergence of a Markov chain 
to its stationary distribution does not depend on the initial state; however, the speed of 
convergence does. One reasonable way to chose the initial state of G 0* in practical 

applications is the following: generate of ~ qj{-\Of_l), where k* = argmax^ i.e. 

Of_l has the largest normalized importance weight. 



2.2 The full AIMS procedure 

At the zero*^ annealing level, j = 0, we generate prior samples Ol^\ . . . , Oq^°\ which usu- 
ally can be readily drawn directly by a suitable choice of the prior distribution 7ro(-). 
Then, using the algorithm described in the previous subsection, we generate samples 
0^\ . . . , o['^^\ which are approximately distributed according to intermediate distribu- 
tion 711(0) oc txq{0)L{OY^ . We proceed like this until the posterior distribution TimiO) oc 
tiq{0)L{OY"^ {f3m = 1) has been sampled. To make the description of AIMS complete, we 
have to explain how to choose the annealing parameters for j = 2, . . . , m — 1. 

It is clear that the choice of the annealing parameters is very important, since, for 
instance, it affects the accuracy of the importance sampling approximation ([8]) and, there- 
fore, the efficiency of the whole AIMS procedure. At the same time, it is difficult to make 
a rational choice of the /3j-values in advance, since this requires some prior knowledge 
about the posterior distribution, which is often not available. For this reason, we propose 
an adaptive way of choosing the annealing scheme. 
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In importance sampling, a useful measure of degeneracy of the method is the effective 
sample size (ESS) A^'^'^ introduced in |KLW94] and |Li96] . The ESS measures how similar 
the importance sampling distribution 7ij_i{-) is to the target distribution vrj(-). Suppose 

Nj_i independent samples ofli, ■ ■ ■ .ofj'i^^ are generated from 7rj_i(-), then the ESS of 
these samples is defined as 

^eff _ = no) 



where w{6) = 7rj{9)/nj^i{9). The ESS can be interpreted as implying that Nj_i weighted 

samples {Ofli, wf}^), . . . , 
from the target distributi 
iV^^2 of N^-i is given by 



samples (6*]^^, ),..., ^\wj^{ are worth A'J^^(< Nj_i) i.i.d. samples drawn 
from the target distribution vrj(-). One cannot evaluate the ESS exactly but an estimate 



where Wj-i = {wfli, ■ ■ ■ , Wj^^'^^) and Wj'l-^ is the normalized importance weight of Oj^li- 
At annealing level j, when is already known, the problem is to define Pj. Let 
7 = N^^^/Nj^i G (0, 1) be a prescribed threshold that characterizes the "quality" of the 
weighted sample (the larger 7 is, the "better" the weighted sample is). Then we obtain 
the following equation: 

1)^-1^ (21) 



7iVi-i 



Observe that this equation can be expressed as an equation for [3j by using 



(E£r^(^fi; 



7iV, 



(22) 



Solving this equation for gives us the value of the annealing parameter at level j. 

Remark 6. Note that when j > 2, the Ofli, • • • , Oj^^'^"^ are generated by the Markov chain 
sampler described in the previous subsection and therefore are not independent. This 
means that, because of the autocorrelations produced by the Markov chain used, the 
"true" ESS of this sample is, in fact, smaller than the one given by (I19p . This is useful to 
remember when choosing 7. Also, this is another reason to select the prior distribution 
7ro(-) so that samples can be generated independently at the start of each AIMS run. 

Combining the AIMS algorithm at a given annealing level with the described adaptive 
annealing scheme gives rise to the following procedure. 

The AIMS procedure 

Input : 

> 7, threshold for the effective sample size (ESS); 

> A^o, A^^i, . . ., where Nj is the total number of Markov chain states to be generated 
at annealing level j; 

t> qi{-\0^Q2{-\0y ■ ■ where qj{-\^) is the symmetric proposal density associated with 
the RWMH kernel at annealing level j. 
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Algorithm: 

Set j = 0, current anneahng level. 

Set f^Q = 0, current annealing parameter. 

Sample 

Calculate Wlf^ = , z = 1, . . . , iVo- 

Calculate the ESS A^q*^ = Nf^iWo) using (l20l) . which measures how similar the 
prior distribution vro(-) is to the target posterior distribution 7r(-). 
while Nf/Nj < 7 do 

Find from equation (122|) . 

Calculate normalized importance weig hts wf, i = l,...,Nj usmg nyn. 

Generate a Markov chain ^j+i, . • • , ^j+i^^'' with the stationary distribution 
7rj_|_i(-) using the AIMS algorithm at annealing level j + 1. 

Calculate = i = 1, . . . , N,+i. 

Calculate the ESS iVJ^i = iVJ^i(iyj+i) using ([20]), which measures how 
similar the intermediate distribution vrj+i(-) is to the posterior 7r(-). 
Increment j to j + 1. 
end while 

Set = 1, current annealing parameter. 

Set m = j + 1, the total number of distributions in the annealing scheme. 
Set ^^U = iyili, ^ = l,...,iV„,_i. 

Generate a Markov chain 6m , • • • , Om"^^ with the stationary distribution 
T^mi') = 7r(-) using the AIMS algorithm at annealing level m. 
Output : 

► 6m , . . . , 6'i^™''~7r(-), samples that are approximately distributed according 
to the posterior distribution. 



2.3 Implementation issues 

As it follows from the description, the AIMS procedure has the following parameters: 7, 
the threshold for the effective sample size; Nj, the length of a Markov chain generated at 
annealing level j = 1, . . . ,m; and qj{-\^), the symmetric proposal density associated with 
the RWMH kernel at level j = 1, . . . , m. Here, we discuss the choice of these parameters 
and how this choice affects the efficiency of AIMS. 

First of all, it is absolutely clear that, as for any Monte Carlo method, the larger the 
number of generated samples is, the more accurate the corresponding estimates of ([1]) are. 
However, we would like to highlight the difference between the roles of A'^-i and Nj at 
annealing level j. While Nj is directly related to the convergence of the chain 6j^\ . . . , 6^^^^ 
to its stationary distribution 7ij{-), Nj_i affects this convergence implicitly through the 
global proposal distribution vr^- ^^^i-): the larger Nj^i, the more accurate approximation 
dH]) is, and, therefore, the less correlated 9j^\...,9j^^^ are. When iVj_i — )■ 00, samples 

9^^\ . . . , 9j^^^ become independent draws from vrj(-), hence the name of the algorithm. 
Thus, if we increase = A'^-i = A'^-, the effect is twofold: first, the sample size increases 
thereby increasing the effective number of independent samples at the j^^ level (typical for 
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any Monte Carlo method); second, the samples become less correlated (a useful feature of 
AIMS), again increasing the effective number of independent samples. As a result of these 
two effects, increasing has a strong influence on the effective number of independent 
posterior samples and so strongly reduces the variance of the estimator for ([T]). 

Suppose now that we are at the last annealing level and generating a Markov chain 
6m , . . . jOm"^^ with the stationary distribution 7rm{-) = vr(-). We will refer to this chain 
as the posterior Markov chain. A critical question faced by users of MCMC methods is 
how to determine when it is safe to stop sampling from the posterior distribution and 
use samples 6m\ ■ ■ ■ , 6^""^ for estimation. In other words, how large should Nm be? One 
possible solution of this "convergence assessment problem" is to use one of the numer- 
ous published diagnostic techniques; for example, see |CC96] for a comparative review of 
MCMC convergence diagnostics. Unfortunately, none of the published diagnostics allows 
one to say with certainty that a finite sample from an MCMC algorithm is representa- 
tive of an underlying stationary distribution. A more empirical approach for assessing 
convergence is to run several posterior Markov chains O^^l^, . . . , 0^^^\ k = 1, . . . ,K, in 

parallel and monitor the corresponding estimators hi, ... , Kk of E^[/i]. A stopping rule 
for convergence is then 

max \hi — hj\ < e, (23) 

l<i<j<K 

where £ is a minimum precision requirement. It is important to emphasise, though, that 
rule (l23ll . although easy-to-understand and easy-to-implement, does not assure conver- 
gence of the chains (especially if 7r(-) is multi-modal): "the potential for problems with 
multiple modes exists whenever there is no theoretical guarantee that the distribution is 
unimodal" [NeOlj . 

The threshold 7 affects the speed of annealing. If 7 is very small, i.e. close to zero, 
then AIMS will have very few intermediate distributions interpolating between the prior 
and posterior distributions, and this will lead to inaccurate results for a moderate number 
of samples. On the other hand, if 7 is very large, i.e. close to one, then AIMS will have 
too many intermediate distributions, which will make the algorithm computationally very 
expensive. 

The proposed method for finding /3j-values is based on the ESS, and /3j is defined 
from equation ( |2T1) (or, equivalently, from ( l22l) ). A similar adaptive approach for defining 
an annealing scheme was proposed in jCC07] . It is based on the coefficient of variation 
(COV) of the importance weights ([9]). More precisely, the equation for j3j is given by 




where 5 > is a prescribed threshold. It is easy to show that the ESS-criterion f l2T]) and 
the COV-criterion (12^ are mathematically equivalent; in fact, AJ^!i = Nj_i/{1 + 5^). We 
prefer to use the former criterion since 7 has a clear meaning: it is the factor by which the 
(essential) sample size of the weighted sample is reduced as a penalty for sampling from 
the importance sampling density instead of the target distribution. It has been found in 
|CC07] that 5 = 1 is usually a reasonable choice of the threshold. This corresponds to 
7 = 1/2. Our simulation results (see Section H]) also show that annealing schemes with 7 
around 1/2 yield good efficiency. 
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The choice of the local proposal density qj{-\^) associated with the RWMH kernel deter- 
mines the ergodic properties of the Markov chain generated by AIMS at level j; it also de- 
termines how efficiently the chain explores local neighborhoods of samples Ofli, ■ ■ ■ , Oj^i'^^ 
generated at the previous level. This makes the choice of qj{-\C,) very important. 

It has been observed by many researchers that the efficiency of Metropolis-Hastings 
based MCMC methods is not sensitive to the type of the proposal density; however, it 
strongly depends on its variance (e.g. |GRG96t lABOlj ). For this reason, we suggest using 
a Gaussian density as the local proposal: 

qMO=mO\^,cp), (25) 

where ^ and cjl are the mean and diagonal covariance matrix, respectively. The scaling 
parameter Cj determines the "spread" of the local proposal distribution. In Section [3l 
we prove (Theorem [3]) that, under certain conditions, the acceptance rate Aj (i.e. the 
expected probability of having a proper Markov transition to 6^^~^^^ ^ satisfies 

> ji, where constant M depends on qj{-\^) and, therefore, on c^. This result can 
be potentially used for finding an optimal cj that would minimize M. Alternatively, 
a more empirical way of choosing the scaling factor consists of adjusting c| based on 
the estimated acceptance rate. This works as follows: first, choose an initial value for 
the scaling factor, c|q, and estimate the corresponding acceptance rate Aj^cjo) based 
on Nj generated Markov states, then modify c^g to obtain an increase in Aj. Whether 
this optimization in is useful depends on whether the accuracy of the estimator that 
is achieved compensates for the additional computational cost. Finally, note that our 
simulation results show (see Section H]) that, as j increases, the corresponding optimal 
scaling factor c| decreases slightly. This observation coincides with intuition, since when 
j increases, the intermediate distributions vrj(-) become more concentrated. 

In the following section we establish the ergodic properties of the Markov chains gen- 
erated by AIMS. 



3 Ergodic properties of AIMS 

Since the discussion in Subsection 12. 1^ which motivated AIMS at annealing level j, in- 
volved delta functions and formal equalities f ll2p and f ll3p . we cannot simply rely on the 
convergence of the IMH algorithm in verification of AIMS; a rigorous proof is needed. 
First we prove that the described algorithm indeed generates a Markov chain with a 
stationary distribution TTj{-). We also explain that when the proposal density qj{-\^) is 
reasonably chosen, vrj(-) is the unique (and, therefore, limiting) stationary distribution of 
the corresponding Markov chain. 

Theorem 1. Let ef\ 9f\ ...be the Markov chain on 9* = 6 \ . . . , ^en- 

erated by the AIMS algorithm at annealing level j , then iTj{-) is a stationary distribution 
of the Markov chain. 

Proof. Let /Cj(-|-) denote the transition kernel of the Markov chain generated by AIMS 
at annealing level j. From the discription of the algorithm it follows that JCj{-\-) has the 
following form: 
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ICMm = E .) min (l, ^1 mm (l, ""'^'X-ul 



K -,(^fi)J I '-.W<^-^(e)J (26) 
{l-A,{e))5em. 



where Aj{6) is the probabihty of having a proper transition ^ to O* \ {6}: 
MO) - I E muJl. ^) „.i Jl, 




A sufficient condition for vrj(-) to be a stationary distribution is for ICj 
the detailed balance condition: 

7r^{dd)lCMm = 'K,{di)lC,{de\i) (28) 

Without loss of generality, we assume that 6 ^ C,, since otherwise (l28l) is trivial. In this 
case )Cj{dC,\0) is given by the ffist term in ( |26l) . since the second term vanishes. Thus, all 
we need to prove is that function 



^(^,0 E -?2,g,(e|^«,) min |l, ^5^1 min |l, ""'^^^ \ (29) 



is symmetric with respect to permutation 6 ^ C,, for all 6',^ G 6*. Taking into account 
(fT2l) and a simple fact that aminjl, b/a} = femin{l, a/b} for all a,b > 0, we have: 



(i)mm<l, 



(30) 



This proves that 7rj(-) is a stationary distribution of the AIMS Markov chain. □ 

A stationary distribution is unique and is the limiting distribution for a Markov chain, 
if the chain is aperiodic and irreducible (see, for example, |Ti94j ). In the case of AIMS, 
aperiodicity is guaranteed by the fact that the probability of having a repeated sample 
zero: for example, if the local candidate state is rejected in step Ic, 

then we automatically have 6j'~^^^ = 6^j\ A Markov chain with stationary distribution 
7r(-) is irreducible if, for any initial state, it has positive probability of entering any set to 
which 7r(-) assigns positive probability. It is clear that if the proposal distribution qj{-\C,) is 
"standard" (e.g. Gaussian, uniform, log-normal, etc), then AIMS generates an irreducible 
Markov chain. In this (•) is therefore the unique stationary distribution of the 

AIMS Markov chain, and for every 6 E Q* 

hm ||/C^«(.|^)-vr,.(.)||Tv = 0, (31) 
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with II ■ II TV denoting the total variation distance. Recall that the total variation dis- 
tance between two measures /ii(-) and /i2(-) on is defined as ||/ii(-) — /^i(-)I|tv = 
sup^ce 1/^1 (^) ^ /^2(^)|- In a simulation setup, the most important consequence of con- 
vergence property ( l3T|) is, of course, that the sample mean converges to the expectation 
of a measurable function of interest almost surely: 

hm ^Y.h{ef)= / hie)n,ie)de (32) 

Convergence f l3ip ensures the proper behavior of the AIMS chain 6j^\ 6j^\ . . . regardless 

of the initial state 6j^\ A more detailed description of convergence properties involves 
the study of the speed of convergence of /C"(-|^) to Evaluation (or estimation) of 

this speed is very important for any MCMC algorithm, since it relates to a stopping rule 
for this algorithm: the higher the speed of convergence /C"(-|^) — >■ vrj(-), the less samples 
are need to obtain an accurate estimate in (!32|) . Recall, following |MT09] . that a chain 
^(1)^ ^(2)^ ... is called uniformly ergodic if 

lim sup||/C"(-|^)-7r(-)||TV = (33) 

n-i>oo gg@ 

The property of uniform ergodicity is stronger than (13T]) . since it guarantees that the speed 
of convergence is uniform over the whole space. Moreover, a Markov chain is uniformly 
ergodic if and only if there exist r > 1 and R < oo such that for all 6* G 

||/C"(.|^)-7r(.)||TV<i?r^", (34) 
that is, the convergence in (133|) takes place at uniform geometric rate |MT09] . 
Theorem 2. // there exists a constant M such that for all 6' G 0* 

nj{9) <Mn^'-'{e), (35) 
then the AIMS algorithm at annealing level j produces a uniformly ergodic chain and 

mm-7r,{.)\Wy<(^l-^y (36) 

Proof. To prove the first part of the theorem we will need the notion of a small set |MT09] . 
A set A C is called a small set if there exists an integer m > and a non-trivial measure 
fim on 0, such that for all 6 & A, B G 0: 

/C"^(5|^) > /i™(S) (37) 

In this case we say that A is /Xm-small. It can be shown |MT09j that a Markov chain is 
uniformly ergodic if and only if its state space is /im-small for some m. Thus, to prove 
the theorem, it is enough to show that 0* is a small set. 
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If (153]) is satisfied, than the following holds for transition kernel (126 p for ^ G O* and 



B C e*: 



•^^ i=i L 



mm < 1 , — > 



7r 



7r,(^)rrf-(e) 



[ .N,^.,,, . I, vr,(Ovrf --(g) y 
^mm vr/ {iUAO-^\di 



{31 



B 



> I min <! n^' \^), 



M 



The sample space G* is therefore ^-small, and the corresponding Markov chain is uni- 
formly ergo die. 

To prove bound f l36|) . first observe, using fl38|) . that 

l|/C,(-|g) - 7r,(-)l|TV = sup \lC,{A\e) - 7r,(A)| < sup \ix,{A) - i-7r,(A)| = 1 - ^ (39) 

For 77, > 1, using the Chapman-Kolmogorov equation Ky-^'"'{A\9) = /g /C™(y4|^)/C'^((i,^|6') 
and stationarity of vrj(-) with respect to /Cj(-|-), we have: 

- 7r,(-)||TV = sup W;{A\e) - 7r,(A)| 



sup 

A 



sup 



sup 

A 



lC,{A\i)lCY\dm - / lC,{A\i)iT,{i)di 
e* ^e* 

/C,(A|0 [/Cf^(rfel^)-vr,(Orfe] 
[/C,(A|0 - vr,(A)] [/CJ-^(rfe|g) - ir,{i)d(\ 



(40) 



e* 



where the last equality follows from the fact that /c,* /C? ^{dE,\6) = fc,*T^j{^)d^ = 1- 
Finally, we obtain: 



l^i(-|^)-^.(-)l|TV<SUpSUp 

B A 



< sup 

B 



sup |/C,(A|0 - 7r,{A)\ [lC]-\dm - ^AOd^] 



B A 



(41) 



\}C,m - 7r,(-)l|TV ■ \\JC]-\-\e) - vr,(-)l|TV < 1 



M 



□ 



Remark 7. Note that if there exists a constant M such that (135!) holds for all 9 E Q*, 
then M > 1 automatically. 
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Corollary 1. // G C is a compact set and qj{-\0 ^■^ ^ Gaussian distribution centered 
at ^, then the AIMS algorithm at annealing level j produces a uniformly ergodic chain and 
123) holds with M given by 

Mj"f.,«^~<^V' (42, 



Proof. Let us show that in this case condition (j35|l is always fulfilled. For any 6* G 0* we 
have: 



1=1 I '^j\^j-lJ 



{d) 2^ w]\ ,^ mm <^ 1 



(43) 



TxMl^) \ ' maxeg©7rj(^) 



- min,geg,(g|^,-i) 
^^^^ max,,evr,W 



Thus, ([35]) holds with M given by ([12]). □ 

Remark 8. Note than the assumption of compactness of the sample space is not very 
restrictive and is typically satisfied in most Bayesian statistics problems. Indeed, to fulfill 
this condition, it is enough to take a prior distribution 7ro(-) with compact support. Next, 
it is clear from the proof, that the conclusion of Corollary [T] holds for different "reasonable" 
(not only Gaussian) proposal distributions qj{-\C)- Therefore, the AIMS algorithm will 
produce a uniformly ergodic Markov chain in many practical cases. 

It has been recognized for a long time that, when using an MCMC algorithm, it is 
useful to monitor its acceptance rate A, i.e. expected probability of having a proper 
Markov jump 6*^*^ to 6''^*+^^ 7^ 6^''\ While in the case of the RWMH algorithm, the finding 
of the optimal acceptance rate is a difficult problem: neither high nor low A is good 
|GRG96] : for IMH the picture is rather simple: the higher A, the better |RC04] . Since 
AIMS is based on the IMH algorithm, their properties are very similar. In particular, one 
should aim for the highest possible acceptance rate of the global candidate state when 
implementing AIMS. 

We finish this section with a result that provides bounds for the acceptance rate of the 
AIMS algorithms. These bounds can be useful for finding the optimal implementation 
parameters. 

Theorem 3. Let Aj be the expected probability of having a proper Markov transition 
associated with the AIMS algorithm at annealing level j. Then 

A < (44) 

i=l 
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where aj{6j'}_^) is probability / flT]) associated with having a proper transition under the 
RWMH transition kernel / fl^) . // ijiE^ holds, then 

A- > T7 (45) 

Proof. For every G 0*, the probability Aj{6) of transition ^ to 0* \ {6} is given by ( 1271 ). 
For its expected value we have: 

A, = I -n^Ame 

f [ E nnn f 1, ^} nnn (l, ] d^dO 



"^Qj i=i 1=1 
To prove the lower bound we use (IT^ in the equation defining Af 
A =/" I min 1 1, ""^'^^^^^ I cie^g 



(46) 



0* JO* 



,7V, 



vr,(0^r-^(^) / vr,(e)7r;-^(0 



(47) 



- JeJe; M l^frf-(0-7rf-(^),' 

_ 2 ^ / 7r,(0 ^ TT.ie) \ _ 1 



where the last probability is equal to 1/2, because 9 and ^ are i.i.d. according to 7rj(-), 
and hence the result. □ 

Remark 9. The AIMS algorithm at annealing level j has two accept/reject steps: one is 
for the local candidate (step Ic) and another is for the global candidate (step 2). 
The right-hand side of (jH]) is nothing else but the local acceptance rate, i.e. expected 
probability of generating a proper local candidate state ^ • • • , 0^'^{'^^}. Basically, 

says that the global acceptance rate Aj can never exceed the local acceptance rate. 
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In fact, it can be deduced directly from the description of the algorithm, since if the local 
candidate is rejected, then the global candidate is automatically rejected and we 
have a repeated sample 6^''^^^ = . 

4 Illustrative Examples 

In this section we illustrate the use of AIMS with three examples: 1) mixture of ten 
Gaussian distributions in two dimensions (a multi-modal case); 2) sum of two multivariate 
Gaussian distributions in higher dimensions; and 3) Bayesian updating of a neural network 
model. 

4.1 Multi-modal mixture of Gaussians in 2D 

To demonstrate the efficiency of AIMS for sampling from multi-modal distributions, con- 
sider simulation from a truncated two-dimensional mixture of M Gaussian densities: 

M 

7r{e) oc 7ro(^) ■ m = W[o,a]x[o,a](^) " «^^-^(^ 1/^- ^'^a) , (48) 

i=l 

where W[o,a]x[o,a](') denotes the uniform distribution on the square [0, a] x [0,a]. In this 
example, a = 10, M = 10, a = 0.1, wi = ... = wio = 0.1, and the mean vectors 
Hi, . . . ,/iio are drawn uniformly from the square [0, 10] x [0, 10]. Because of our interest 
in Bayesian updating, we refer to 7r(-) in ( 148|1 as a posterior distribution. 

Figure [3]^a) displays the scatterplot of 10^ posterior samples obtained from AIMS. No- 
tice there are two clusters of samples that overlap significantly near 6 = (4, 4) that refiect 
two closely spaced Gaussian densities but the other 8 clusters are widely spaced. The 
parameters of the algorithm were chosen as follows: sample size = 10^ per annealing 
level; the threshold for the ESS 7 = 1/2; the local proposal density qj{-\^) = A/'(-|^, CH2), 
with c = 0.2. The trajectory of the corresponding posterior Markov chain, i.e. the chain 
generated at the last annealing level with stationary distribution vr(-), is shown in Fig- 
ure [3t^b). Black crosses x represent the mean vectors /ii, . . . ,/iio- As expected, the chain 
does not exhibit a local random walk behavior and it moves freely between well-separated 
modes of the posterior distribution. 

The described implementation of AIMS leads to a total number of m = 6 intermediate 
distributions in the annealing scheme. Figure H] shows how annealing parameter Pj changes 
as a function oij for 50 independent runs of the algorithm. It is found that in all considered 
examples, Pj grows exponentially with j. 

Let us now compare the performance of AIMS with the Random Walk Metropolis- 
Hastings algorithm. For a fair comparison, the Metropolis-Hastings algorithm was imple- 
mented as follows. First, a sample of Nq = 10^ points ^q^'*, . . . , Oq^''^ was drawn from the 
prior distribution vro(-) = W[o,a]x[o,a](') and the corresponding values of the likelihood func- 
tion L{6) = ^fiiWif/{6\fii,a'^l2) were calculated, Li = L{9q^). Then, starting from the 
point with the largest likelihood, 6**-^'' = ^q'^'*, k = argmaxLj, a Markov chain 
with stationary distribution 7r(-) was generated using the Metropolis-Hastings algorithm. 
The proposal distribution used was q{-\^) = A^(-|^,c^l2) with c = 0.2, and the length of 
the chain was A^ = 5 ■ 10^. Thus, the total number of samples used in both AIMS and 
RWMH was Nt = 6 ■ 10^. The scatterplot of posterior samples obtained from RWMH 
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and the trajectory of the corresponding Markov chain are show in Figures [31^c) and|3]^d), 
respectively. While the AIMS algorithm successfully sampled all 10 modes with the ap- 
proximately correct proportion of total samples, RWHM completely missed 7 modes. 

Suppose that we are interested in estimating the posterior mean vector, /i'^ = (/i^, yUg); 
and the components (c"i ('^2 ^12 of the posterior covariance matrix S'^. Their true 
values are given in Table [U along with the AIMS estimates in terms of their means and co- 
efficients of variation averaged over 50 independent simulations, all based on 10"^ posterior 
samples. 

Figure [5] displays the mean square error (MSE) of the AIMS estimator for the posterior 
mean and covariance matrix for different values of the scaling factor c. The MSE was 
estimated based on 50 independent runs of the algorithm. An interesting observation is 
that the MSE as a function of c is nearly flat around the optimal, Copt ~ 0.15, i.e. the one 
that minimizes the MSE. 

4.2 Mixture of two higher-dimensional Gaussians 

To demonstrate the efficiency of AIMS for higher dimensionality, consider simulation from 
a truncated sum of two multivariate Gaussian densities: 

71^9) (X ■ L^(^) = ■ {^^i9\^^,,aX)+^^i9\^^2,aX)) , (49) 

where a = 2, /ii = (0.5, . . . , 0.5), ^2 = (— 0.5, . . . , — 0.5), and cr = 0.5. Thus, vr'^(-) 
is a bimodal distribution on a rf-dimensional cube [—a, a^. Suppose that a quantity 
of interest is the function h : [—a, [—a, a] that gives the largest component of 

9 = i9,,...,9d)e[-a,a]'': 

h{9) = max{9i,...,9d} (50) 

and we want to estimate its expectation with respect to vr'^(-) using posterior samples 
~ 7r'^(-) as follows: 

1 ^ 

/i = E,.[/i]^/i^ = -^/i(^«) (51) 

i=l 

This example is taken from |CC07] . where the Transitional Markov chain Monte Carlo 
method (TMCMC) for sampling from posterior densities was introduced. 

Here, we consider five cases: d = 2,4,6,10, and 20. The performance of TMCMC 
was examined for only the first three cases in |CC07] . The last two cases are higher 
dimensional, and, therefore, more challenging. 

The details of implementation and simulation results from 50 independent runs are 
summarized in Table O First of all, observe that AIMS outperforms TMCMC, when 
d = 2,4,6. Both methods are capable of generating samples from both modes of the 
posterior; however, the probabilities of the modes (each is 1/2 in this example) are found 
more accurately by AIMS. 

Remark 10. In addition to the first three cases, five other scenarios with different prob- 
abilities of modes and different values of a were examined in jCC07] . It is found that 
AIMS outperforms TMCMC in all these cases too. 
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Results presented in Table [2] help to shed some light on the properties of the optimal 
scaling parameter Copt for the proposal density qj{-\^) = J\f{-\^,cHd). It appears Copt 
depends not only on the dimension d, which is expected, but also on N, the number of 
samples used per each annealing level. The latter dependence is explained by the fact 
that the global proposal distribution 7rj^(-) for the AIMS Markov chain depends both on 
and c: vrj^(-) is a weighted sum of RWMH transition kernels with Gaussian proposal 
distributions, whose spread is controlled by c. When A^ is fixed, Copt is a monotonically 
increasing function of d, since in higher dimensions, for optimal local exploration of the 
neighborhoods of Of}i, . . . , Oj^l, we have to be able to make larger local jumps from 6']^\ 
to ^i. When d is fixed, Copt is a monotonically decreasing function of A^, since the more 
samples 6*]^^, . . . ,Oj^i that have been generated at the previous level, the more we can 
focus on local exploration of their neighborhoods without worrying too much about regions 
that lie far away. If we think of the support of = A/'(-|^]^\, c^Id) as lying mostly 

(k) 

in a (i- dimensional ball of radius c centered at then we can explain the dependence 
of Copt on A^ as follows: the more ci-dimensional balls of radius c we have, the smaller c 
we can use for covering the sample space. 

It is interesting to look at how the local and global acceptance rates (see Remark ^ 
depend on the scaling parameter c. Figures [6l [TJ and |8] display these acceptance rates 
along with the coefficient of variation S of the AIMS estimator for the first three cases: 
d = 2,4 and 6, based on 50 independent runs. As expected, the global acceptance rate is 
always smaller than the local acceptance rate, and the minimum value of 6 corresponds 
to the maximum value of the global acceptance rate. Observe also that the peak of the 
global acceptance rate slides to the left, when j increases. This suggests that it is more 
efficient to use smaller values of c at higher annealing levels. Indeed, it is natural to 
expect that c°^^ > c°?|.\, since the intermediate distribution 7rj+i(-) is more concentrated 
than vrj(-). 

Finally, we draw attention to Case 4 in Table 2 where d = 10 with A^ = 10^ and 
N = 2 ■ 10^ samples per annealing level. Usually for Monte Carlo based methods, the 
coefficient of variation 6 of the estimator is proportional to 1/ y/Nt, where A^ is the total 
number of samples. Thus, the doubling of sample size will result in the reduction of 6 by 
the factor of l/\/2 ^ 0.71. For AIMS, however, the decrease of 5 is more significant: from 
S = 26.7% to 6 = 12.2%, i.e. approximately by the factor of 0.46. This is because, as 
explained in Subsection 12.31 the increase of A^ affects not only the total sample size, but 
also improves the global proposal distribution 7rj^(-)- This improvement of ^j^(-) results 
in the generation of less correlated samples at each annealing level and, therefore, leads 
to an additional reduction of the coefficient of variation 6. 

4.3 Bayesian updating of a neural network 

To illustrate the use of AIMS for Bayesian updating, consider its application to a feed- 
forward neural network model, one of the most popular and most widely used models 
for function approximation. The goal is to approximate a (potentially highly nonlinear) 
function / : X — )■ M, where X C is a compact set, based on a finite number of 
measurements tji = f{xi), i = 1, . . . ,n, by using a finite sum of the form 

M 

fix,9) = J2a,^{{x,/3,)+j,) (52) 
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where 9 denotes the model parameters a^, 7^- e M and Pj G MP, (-, ■) is the standard scalar 
product in M^, and \1/ is a sigmoidal function, the typical choice being either the logistic 
function or the tanh function that is used in this example: 

*(^) = ^7-^- (53) 

+ e ^ 

Model ( 152|) is called a feed-forward neural network (FFNN) with activation function fl53l) . 
p input units, one hidden layer with M hidden units, and one output unit. The parameters 
I3j and aj are called the connection weights from the input units to the hidden unit j and 
the connection weights from the hidden unit j to the output unit, respectively. The term 
7j is a designated bias of the hidden unit j and it can be viewed as a connection weight 
from an additional constant unit input. Schematically, the FFNN model is shown in 
Figure [H 

The rationale behind the FFNN approximation method follows from the universal 
approximation property of FFNN models |Cy89 [HSW89j : that is, a FFNN with sufficient 



number of hidden units and properly adjusted connection weights can approximate most 
functions arbitrarily well. More precisely, finite sums f l52|) over all positive integers M are 
dense in the set of real continuous functions on the p-dimensional unit cube. 

Let A denote the FFNN architecture, i.e. the input-output model fl52l) together with 
information about the type of activation function number of input units p, and number 
of hidden units M . In this example, we use p = 1, M = 2, and \l/ is given by ([S3D, so the 
model parameters 6 = (ai, 02, /32, 7i, 72) G 6 = M^. 

Deterministic model A of function / given by /(x, 6) in (152!) can be used to construct 
a Bayesian (stochastic) model Ai of function / by stochastic embedding (see the details in 
[BeOStlBelOj ). Recall, that by definition, a Bayesian model Ai consists of two components: 

1. An input-output probability model y ~ p{y\x,6,Ai), which is obtained by intro- 
ducing the prediction-error 

e = y-f{x.e), (54) 

which is the difference between the true output y = f{x) and the deterministic model 
output f{x,9). A probabihty model for e is introduced by using the Principle of 
Maximum Entropy [Ja57l IJa03] , which states that the probability model should be 
selected to produce the most uncertainty subject to constraints that we wish to 
impose (the selection of any other probability model would lead to an unjustified 
reduction in the prediction uncertainty). In this example, we impose the following 
constraints: K[6] = and var[£:] = cr^ with e unbounded. The maximum entropy 
PDF for the prediction-error is then e ~ A/'(0,(T^). This leads to the following 
input-output probability model: 

p{y\x,9,M)=Af(^y\f{x,9),a'^ (55) 

Here, the prediction-error variance is included in the set of model parameters 
where, for convenience, we define 6i = logcr"^, so the parameter space is now 

e = M^. 

2. A prior PDF 7ro(6'|A^) over the parameter space which is chosen to quantify the 
initial relative plausibility of each value of ^ in 6. In this example, the prior distri- 
butions are assumed to be: 

a, ~Ar(0,O, /3, ~Ar(0,a2), 7,- ~ Ar(0, a^), 0^ = loga'^ ~ Ar(0, <), (56) 
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with (Tq = (T/3 = (J^ = (Te^ = 5. Thus, the prior PDF in our case is 

M 

Md\M) = A/'(^7|0, H-^K-IO' |0, ^^)Ar(7,-|0, a^). (57) 

i=i 

Let P denote the training data, V = {{xi,yi), . . . , (x„, y„)}, treated as independent 
samples, then the hkehhood function which expresses the probabihty of getting data V 
based on the probabihty model ( 155|) is given by 

n 

L{e)=p{V\e,M) = l[p{yi\x,,e,M) (58) 

i=l 

In this example, data are synthetically generated from (155|) with cti = 5, 02 = —5, 
(3i = —1, /32 = —3, 7i = 5, 72 = 2, cr = 0.1, and the input Xi = i/10, for i = 1, . . . , n = 100. 

Finally, using Bayes' theorem, we can write the posterior PDF 7r{6\V, M) for the 
uncertain model parameters: 

7r{9\V,M) (xno{e\M)- L{e) 

= Af{er\0,al)]jAf{a,\0,al)Af{P,\0,aj)Af{j,\0,a^^ ^ ' 

j=i i=i 

Under the Bayesian framework, the mean prediction oi y = f{x) from observable x 
can be obtained by integrating out the nuisance parameters: 

¥.^[y\x,V,M]= ! f{x,e)7t{9\V,M)d9 (60) 
Je 

To demonstrate the efficiency of AIMS for the mean prediction problem, we use it to 
sample from the posterior PDF ( 159|) and use Monte Carlo simulation in ( I6OI) . The param- 
eters of the AIMS algorithm are chosen as follows: sample size = 3 x 10^ per annealing 
level; the threshold for the ESS 7 = 1/2; the proposal density qj{-\^) = A/'(-|^, c^I?), with 
c = 0.5. This implementation of AIMS leads to a total number of m = 10 intermediate 
distributions in the annealing scheme. The obtained posterior samples 6m\ ■ ■ ■ ,0m are 
then used to approximate the integral on the right-hand side of (M?]) : 

N _ 

/ fix, 9)n{9\V, M)d9 ^ ^ 5^ fix, 9^) = U^) (61) 
-^0 i=i 

The true function y = f{x) as well as its AIMS approximation fm{x) are shown in Fig- 
ure [THl A few "intermediate approximations" fj{x), which are based on 9j^\ . . . , 9^p ~ vTj, 

are plotted to show how fj{x) approaches f{x) when j — m. To visualize the uncertainty 
for the AIMS approximation, we plot its 5th and 95th percentiles in Figure [TTl 

5 Concluding Remarks 

In this paper, a new scheme for sampling from posterior distributions, called Asymptot- 
ically Independent Markov Sampling (AIMS), is introduced. The algorithm is based on 
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three well-established and widely-used stochastic simulation methods: importance sam- 
pling, MCMC, and simulated annealing. The key idea behind AIMS is to use samples 
drawn from 7ij_i{-) as an importance sampling density to construct an approximation 
7rj^(-) of vrj(-), where vro(-), . . . ,vrm(-) is a sequence of intermediate distributions interpo- 
lating between the prior vro(-) and posterior 7r(-) = iVmi-)- This approximation is then 
employed as the independent proposal distribution for sampling from '/rj(-) by the inde- 
pendent Metropolis-Hastings algorithm. When N ^ oc, the AIMS sampler generates 
independent draws from the target distribution, hence the name of the algorithm. 

Important ergodic properties of AIMS are derived. In particular, it is shown that, 
under certain conditions (that are often fulfilled in practice), the AIMS algorithm produces 
a uniformly ergodic Markov chain. The choice of the free parameters of the algorithm 
is discussed and recommendations are provide for their values, both theoretically and 
heuristically based. The efficiency of AIMS is demonstrated with three examples, which 
include both multi-modal and higher-dimensional target posterior distributions. 
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Parameter 










'^12 


True value 


5.23 


5.75 


4.51 


3.37 


-1.30 


AIMS mean 


5.20 


5.73 


4.56 


3.32 


-1.25 


AIMS GOV 


2.4% 


2.0% 


8.2% 


8.2% 


27.7% 



Table 1 : True values of the posterior parameters and the AIMS estimates in terms of their means and 
coefEcients of variation averaged over 50 simulations [Example I4.1j . 



Case 


d 


h 


TMCMC: hN, (6) 


AIMS: hN, (S) 


N 


7 


Copt 


m 


1 


2 


0.29 


0.28 (12.3%) 


0.29 (8.8%) 


10^ 


1/2 


0.2 


3 


2 


4 


0.51 


0.54 (10.0%) 


0.51 (6.9%) 


10^ 


1/2 


0.4 


4 


3 


6 


0.64 


0.65 (15.7%) 


0.64 (10.4%) 


10^ 


1/2 


0.6 


4.95 


4 


10 


0.76 




0.76 (26.7%) 


10^ 


1/2 


0.7 


5.84 




10 


0.76 




0.76 (12.2%) 


2 ■ 10^ 


1/2 


0.6 


5.98 


5 


20 


0.92 




0.95 (42.1%) 


4- 10^ 


1/2 


0.5 


5.58 



Table 2: Summary of the simulation results: d is the dimension of the sample space; h and /itv are the 
exact value of E^d [h] and its estimated value, respectively; S in parentheses is the corresponding coefficient 
of variation; N, 7, Copt, and ffi are the number of samples used per annealing level, the threshold for 
the ESS, the (nearly) optimal value of the scaling parameter, and the average number of distributions in 
the annealing scheme, respectively. The AIMS results are based on 50 independent runs. The TMCMC 
results are taken from |CC07| and are based on 50 independent runs [Example 14. 2| . 
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Figure 1: The top panels show the distribution 7rj(-) (solid lines) and its approximation tt^- ^ ^(•), for 
iVj_i = 5 (left) and iVj-i = 50 (right). Dashed hnes and bars correspond to the continuous and discrete 



parts of TT - ^~^(-)i respectively. The bottom panel shows the convergence of h\(Nj-\) = E.jVj_i [hi] and 
h2{Nj-i) = E.iVj_i [/i2] to the true values, and 1, respectively [Example 2.1]. 
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respectively; concentric circles show the correspondence between OjJi that has been chosen in step la and 

the corresponding local candidate ~ '/('l^'j-i) that has been generated in step lb. In this schematic 
picture, all shown candidate states are accepted as new states of the Markov chain. 
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Figure 3: (a) Scatterplots of 10^ posterior samples; (b) the trajectories of the corresponding posterior 
Markov chain obtained from AIMS; and (c), (d) corresponding plots from RWMH. Black crosses x 
represent the modes /ii, . . . , /iio of 7r(-) [Example 4.1]. 
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Annealing level j 

Figure 4: Annealing parameter (3j as a function of annealing level j for 50 independent runs of AIMS 
[Example 4.1]. 
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Figure 6: Coefficient of variation 5 of the AIMS estimate (top panel), global aeceptance rate (middle 
panel), and local acceptance rate (bottom panel) as functions of c for Case 1 (d = 2) [Example 14 .21 . 
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Figure 7: Coefficient of variation 5 of the AIMS estimate (top panel), global acceptance rate (middle 
panel), and local acceptance rate (bottom panel) as functions of c for Case 2 (d = 4) [Example 14 .21 . 
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Figure 8: Coefficient of variation 5 of the AIMS estimate (top panel), global acceptance rate (middle 
panel), and local acceptance rate (bottom panel) as functions of c for Case 3 (d = 6) [Example [ 

35 



Figure 9: The feed-forward neural network model with one hidden layer (shown by hatching) [Exam- 
ple 4.3]. 
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Figure 10: The true function f{x) (solid curve), its posterior approximation fwix) (dashed curve) 
which is constructed using AIMS, and "intermediate annealing approximations": fo{x) (dotted curve) 
which is based on prior samples, /2(a;) and fs^x) (dashed-dotted curves) [Example 4.3]. 
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Figure 11: The true function f{x) (solid curve), its AIMS approximation fio{x) (dashed curve), and 
5th and 95th percentiles of fio{x) (dotted curves) [Example 4.3]. 
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